Energy Performance of Floating-Point Matrix Multiplication on FPGAs

نویسندگان

  • Ling Zhuo
  • Viktor K. Prasanna
چکیده

Floating-point matrix multiplication is a basic kernel in scientific computing. It has been shown that implementations of this kernel on FPGAs can achieve high sustained performance [1]. However, to the best of our knowledge, existing work on FPGA-based floating-point matrix multiplication considers the optimization of latency or area only. In this paper, we analyze the impact of various parameters on the energy dissipation of two floating-point matrix multiplication algorithms developed by us. Due to space limitation, the algorithms are not presented here. Details of the algorithms (Algorithm 1 and Algorithm 2) can be found in [1]. We identify several parameters that affect the energy dissipation of the algorithms. These include the number of pipeline stages within the floating-point units, the block size for block matrix multiplication, and the number of PEs configured on the device. These parameters give rise to a large implementation space for the algorithms. This implementation space is explored using domain-specific modeling developed by us, a high-level approach to quickly evaluate the energy performance of various designs [2]. For small problem sizes, zero padding is required if deeply pipelined floating-point units are employed. Therefore, even though these implementations achieve higher clock speeds, they dissipate more energy than the implementations with moderately pipelined floating-point units. For large problem size, the effect of zero padding is negligible. In this case, using floating-point units that has the highest “throughput per unit area” achieves the best energy dissipation performance. When the matrix size is large, block matrix multiplication needs to be employed. With small block sizes, deeply pipelined implementations dissipate much more energy than moderately pipelined implementations because of zero padding. The impact of zero padding dwindles as the block size increases. Therefore, the block size depends on both the problem size and the floating-point units employed. We implemented our algorithms using Xilinx ISE 5.2i [3], with Xilinx Virtex-II Pro XC2VP125 as our target device. Low level power simulation using XPower [3] was performed to verify the energy results of the high-level modeling. In our experiments, series of matrices were fed into the architecture consecutively so that zero padding is not needed. Our experiments show that the energy dissipation estimated by the highlevel modeling is within 12% of the actual values obtained by the low level simulation. Theoretically, Algorithm 2 uses

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sparse Matrix-Vector Multiplication on FPGAs

Floating-point Sparse Matrix-Vector Multiplication (SpMXV) is a key computational kernel in scientic and engineering applications. The poor data locality of sparse matrices signicantly reduces the performance of SpMXV on general-purpose processors, which rely heavily on the cache hierarchy to achieve high performance. The abundant hardware resources on current FPGAs provide new opportunities to...

متن کامل

Hera: a Reconfigurable and Mixed-mode Parallel Computing Engine on Platform Fpgas*

The high price, long design and development cycles, programming difficulty and high maintenance cost of supercomputers limit their range of potential applications. Recent advances in Field-Programmable Gate Arrays (FPGAs) have made feasible the development of highperformance and programmable parallel systems on a programmable chip (PSOPC). PSOPC’s yield highperformance at low cost for many para...

متن کامل

White Paper Designing and Using FPGAs for Double-Precision Floating-Point Math

Floating-point arithmetic is used extensively in many applications across multiple market segments. These applications often require a large number of calculations and are prevalent in financial analytics, bioinformatics, molecular dynamics, radar, and seismic imaging, to name a few. Apart from integer and single-precision 32-bit floating-point math, many applications demand higher precision, f...

متن کامل

Mapping Sparse Matrix-Vector Multiplication on FPGAs

Higher peak performance on Field Programmable Gate Arrays (FPGAs) than on microprocessors was shown for sparse matrix vector multiplication (SpMxV) accelerator designs. However due to the frequent memory movement in SpMxV, system performance is heavily affected by memory bandwidth and overheads in real applications. In this paper, we introduce an innovative SpMxV Solver, designed for FPGAs, SSF...

متن کامل

The Algorithms for FPGA Implementation of Sparse Matrices Multiplication

In comparison to dense matrices multiplication, sparse matrices multiplication real performance for CPU is roughly 5–100 times lower when expressed in GFLOPs. For sparse matrices, microprocessors spend most of the time on comparing matrices indices rather than performing floating-point multiply and add operations. For 16-bit integer operations, like indices comparisons, computational power of t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004